The purpose of this tutorial will be to walk new users through some of the steps necessary to explore Whole Genome (WGS) and Whole Exome (WES) sequencing data generated form the 10x Genomics Chromium platform and the Longranger pipeline. We will investigate the Linked-Read data using a variety of tools, all of which are freely available either from 10x Genomics or
Things to know about this workshop
| Username | IP | Terminal | RStudio | Download Files |
|---|---|---|---|---|
| testuser1 | 18.233.155.236 | <a href='http://18.233.155.236:8888/terminals/1' target='_blank'>terminal</a> | <a href='http://18.233.155.236:8787' target='_blank'>rstudio</a> | <a href='http://18.233.155.236' target='_blank'>download files</a> |
| testuser2 | 54.236.82.65 | <a href='http://54.236.82.65:8888/terminals/1' target='_blank'>terminal</a> | <a href='http://54.236.82.65:8787' target='_blank'>rstudio</a> | <a href='http://54.236.82.65' target='_blank'>download files</a> |
| testuser3 | 52.87.214.218 | <a href='http://52.87.214.218:8888/terminals/1' target='_blank'>terminal</a> | <a href='http://52.87.214.218:8787' target='_blank'>rstudio</a> | <a href='http://52.87.214.218' target='_blank'>download files</a> |
| testuser4 | 52.90.195.255 | <a href='http://52.90.195.255:8888/terminals/1' target='_blank'>terminal</a> | <a href='http://52.90.195.255:8787' target='_blank'>rstudio</a> | <a href='http://52.90.195.255' target='_blank'>download files</a> |
| testuser5 | 54.197.18.212 | <a href='http://54.197.18.212:8888/terminals/1' target='_blank'>terminal</a> | <a href='http://54.197.18.212:8787' target='_blank'>rstudio</a> | <a href='http://54.197.18.212' target='_blank'>download files</a> |
IGV is one of the most common tools used in the field of genomeics to view a variety of different data types. If you do not have IGV, or don’t have the latest version (2.4), please download it from https://software.broadinstitute.org/software/igv/download
First open IGV and load the 10x data. There are two .bam files that can be explored. Here’s a snapshot of the 10x-bam-files directory
ubuntu@ip-172-31-63-156:~/10x-bam-files$ ls /home/ubuntu/10x-bam-files
total 1.3G
-rw-rw-r-- 1 ubuntu ubuntu 64M Apr 2 17:37 NA12878_chr21_phased_possorted_exome_bam.bam
-rw-rw-r-- 1 ubuntu ubuntu 83K Apr 2 17:37 NA12878_chr21_phased_possorted_exome_bam.bam.bai
-rw-rw-r-- 1 ubuntu ubuntu 1.2G Apr 2 17:41 NA12878_chr21_phased_possorted_WGS_bam.bam
-rw-rw-r-- 1 ubuntu ubuntu 116K Apr 2 17:37 NA12878_chr21_phased_possorted_WGS_bam.bam.bai
-rw-rw-r-- 1 ubuntu ubuntu 891 Apr 2 17:44 README.md
In order to load one of the .bam files follow the following steps.
Depending on what view you are in you might see reads paired in a variety of different ways. To show some of the special features of the 10x data:
This will order the reads by barcode and, if possible, phase the region that you are investigating. Groups of reads will be “linked” to each other by the individual barcodes associated with the single molecule that the reads originated from. The reads and barcodes will also be seperated into phased haplotypes 1 (red) and 2 (blue). Those reads that could not be phased re represented by grey lines. These unphased reads are still useful and are utalized in most steps of Longranger.
Some things to keep in mind when thinking about 10x data
The Loupe Browser is a 10x specific genome browser that more fully captures some of the enhanced information that Linked-Reads will get you in your WGS or WES experiments. Loupe is fully integrated into the Longranger pipeline and .loupe files are automatically generated by defalut.
Our workshop has the Loupe browser setup at the address http://34.205.68.94:3000/loupe/
If you look at the 10x-loupe-files directory you can see three loupe files to explore.
ubuntu@ip-172-31-63-156:~/10x-loupe-files$ ls /home/ubuntu/10x-loupe-files
total 422M
-rw-rw-r-- 1 ubuntu ubuntu 40M Dec 8 09:02 LungTumorT.loupe
-rw-rw-r-- 1 ubuntu ubuntu 383M Apr 10 20:05 NA12878_exome.loupe
First let’s go to http://34.205.68.94:3000/loupe/ and click on NA12878_exome.loupe. This will bring us to the main page of Loupe which looks like this.
As you can see we have some nice statistics about the performace of our sequencing experiment including:
Let’s navigate to chr17:41,074,530-41,399,282
There are a lot of “clickable” features to see here:
Here you can see a similar output to that of IGV but it is a bit more digestable for viewing linked reads.
Once again, we can see reads clearly phased by haplotype and reads that do not get phased in grey.
If we open up the http://34.205.68.94:3000/loupe/load?file=NA12878_wgs.loupe file and navigate to chr2:34,595,838-34,795,838;chr2:34,636,560-34,836,560 we can very clearly see a hemizygous deletion in both the
and the
There are a few things that that make the 10x VCF unique. Overall 10x abides by the VCF 4.x standard. However, there is some additional information that takes advantage of the 10x specific technology. Documents covering the the 10x VCF spec can be found here
If we navigate to a 10x VCF and have a look
cd /home/ubuntu/10x-vcf-files
zcat NA12878_exome_varcalls.vcf.gz | less
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 47669
chr1 10616 . CCGCCGTTGCAAAGGCGCGCCG C 54.7 PASS AC=2;AF=1.0;AN=2;DP=7;ExcessHet=3.0103;FS=0.0;MLEAC=2;MLEAF=1.0;MQ=54.51;QD=27.35;SOR=2.303;POSTHPC=1;POSTHPB=G;MUMAP_REF=20.0;MUMAP_ALT=42.5;AO=6;RO=0;MMD=-1.0;RESCUED=0;NOT_RESCUED=9;POSTDNB=GC;POSTDNC=1;POSTTNB=GCC;POSTTNC=1;HAPLOCALLED=0 GT:AD:DP:GQ:PL:BX:PS 1|1:0,2:2:6:91,6,0:,CCGTACTGTTGTGTCA-1_43;CCATGTCGTTTAAGCC-1_43;GGACTTACAGATCGGA-1_43_43;ATCATGGGTAACTTCG-1_43;TCAGATGAGTGAGAAG-1_43:10616
chr1 11457 . C G 29.77 10X_ALLELE_FRACTION_FILTER AC=1;AF=0.5;AN=2;BaseQRankSum=0.0;ClippingRankSum=0.0;DP=4;ExcessHet=3.0103;FS=0.0;MLEAC=1;MLEAF=0.5;MQ=37.59;MQRankSum=-1.383;QD=7.44;ReadPosRankSum=-0.674;SOR=0.693;MUMAP_REF=11.7273;MUMAP_ALT=30.0;AO=1;RO=2;MMD=3.5;RESCUED=1;NOT_RESCUED=12;HAPLOCALLED=0 GT:AD:DP:GQ:PL:BX:PS 0/1:2,2:4:58:58,0,150:AGTGGGAAGGTTAGTA-1_74;AAGTGGGCAAAGCAAT-1_74,TGTGGGCTCTAGAGTC-1_70:11457
chr1 11803 . T C 14.91 10X_QUAL_FILTER;10X_ALLELE_FRACTION_FILTER AC=1;AF=0.5;AN=2;BaseQRankSum=-2.2;ClippingRankSum=0.0;DP=9;ExcessHet=3.0103;FS=0.0;MLEAC=1;MLEAF=0.5;MQ=44.77;MQRankSum=-2.2;QD=1.66;ReadPosRankSum=-1.383;SOR=0.223;MUMAP_REF=13.2368;MUMAP_ALT=3.57143;AO=0;RO=9;MMD=2.71429;RESCUED=0;NOT_RESCUED=45;HAPLOCALLED=0 GT:AD:DP:GQ:PL:BX:PS 0/1:7,2:9:43:43,0,334:GCTAGCGAGTTGAGAT-1_74_65;GCCAAATGTAGGTCGA-1_70;GAGTCCGTCCGCATCT-1_45;TGCATAGAGACTTCTG-1_74;CTGTTGCAGACCATAA-1_74;CTCAAAGGTCATGTAC-1_74;GAGATGGAGGTCAGAC-1_74_74,:11803
chr1 11863 . C A 59.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-0.319;ClippingRankSum=0.0;DP=4;ExcessHet=3.0103;FS=0.0;MLEAC=1;MLEAF=0.5;MQ=50.02;MQRankSum=-1.15;QD=14.94;ReadPosRankSum=0.319;SOR=0.916;MUMAP_REF=9.3421;MUMAP_ALT=17.2222;AO=3;RO=6;MMD=1.34797;RESCUED=0;NOT_RESCUED=47;HAPLOCALLED=1 GT:AD:DP:GQ:PL:BX:PS:PQ:JQ 0|1:1,3:4:33:88,0,33:GCTAGCGAGTTGAGAT-1_70;ATCATGGCATCTATGG-1_45;TGCATAGAGACTTCTG-1_74;CTCAAAGGTCATGTAC-1_74;GAGATGGAGGTCAGAC-1_70,GACACATGTTGTGGCC-1_45;TTCGAAGCATCCTGGG-1_74;GCCAAATGTAGGTCGA-1_74:1:25:25
chr1 11921 . T C 12.05 10X_QUAL_FILTER;10X_ALLELE_FRACTION_FILTER AC=1;AF=0.5;AN=2;BaseQRankSum=0.493;ClippingRankSum=0.0;DP=10;ExcessHet=3.0103;FS=3.31;MLEAC=1;MLEAF=0.5;MQ=48.28;MQRankSum=0.0;QD=1.2;ReadPosRankSum=-0.431;SOR=2.303;MUMAP_REF=11.1489;MUMAP_ALT=18.8;AO=1;RO=9;MMD=2.60317;RESCUED=0;NOT_RESCUED=52;HAPLOCALLED=0 GT:AD:DP:GQ:PL:BX:PS 0/1:8,2:10:40:40,0,311:GAGTCCGTCCGCATCT-1_74;CTACTTAGTGTGGCTC-1_74;ACGGAGATCCTTGGTC-1_74;TGCATAGAGACTTCTG-1_74;TTCGAAGCATCCTGGG-1_74;GAGATGGAGGTCAGAC-1_55;AGCTTCCGTTACGCCG-1_74;GACACATGTTGTGGCC-1_70,CGACTTCTCAACAGTC-1_74:11921
Not only do you see some of the typical things
You can also see some of the extra 10x “stuff”. Mostly in the FORMAT field
; seperated barcodes cover the first allele followed by a , which seperates barcodes associated with reads covering the second alleleThis extra information can be very useful for looking at varints that may or may not be in cis or trans. This can be especially useful if you have compound heterozygote variants. All the alleles on one side of the seperator (|) with the same PS are from the same haplotype.
Note: For GT, | represents a phased variant \ represents an unphasd variant
The 10x/Linked-Read .bam file contains much of the same information that a typical short read .bam would, but like the VCF has some extra information. Documents covering the the 10x .bam spec can be found here
If we take a look we can see some interesint features:
cd /home/ubuntu/10x-bam-files
samtools view -h NA12878_chr21_phased_possorted_exome_bam.bam | less
@PG ID:lariat PN:longranger.lariat CL:lariat -reads=/mnt/analysis/marsoc/pipestances/HGKNJBBXX/PHASER_SVCALLER_EXOME_PD/49255/1016.1.1-0/PHASER_SVCALLER_EXOME_PD/PHASER_SVCALLER_EXOME/_LINKED_READS_ALIGNER/_FASTQ_PREP_NEW/SORT_FASTQS/fork0/join-u29c07c9de1/fi
les/chunk-0.fasth.gz -read_groups=49255:MissingLibrary:1:unknown_fc:0 -genome=/mnt/opt/refdata_new/hg19-2.0.0/fasta/genome.fa -sample_id=49255 -threads=4 -centromeres=/mnt/opt/refdata_new/hg19-2.0.0/regions/centromeres.tsv -trim_length=7 -first_chunk=True -output=/mnt/ana
lysis/marsoc/pipestances/HGKNJBBXX/PHASER_SVCALLER_EXOME_PD/49255/1016.1.1-0/PHASER_SVCALLER_EXOME_PD/PHASER_SVCALLER_EXOME/_LINKED_READS_ALIGNER/BARCODE_AWARE_ALIGNER/fork0/chnk000-u29c07c9e62/files VN:'576387f'
@PG ID:attach_phasing PN:longranger.attach_phasing PP:lariat VN:1016.1.1
@PG ID:longranger PN:longranger PP:attach_phasing VN:1016.1.1
@CO 10x_bam_to_fastq:R1(RX:QX,TR:TQ,SEQ:QUAL)
@CO 10x_bam_to_fastq:R2(SEQ:QUAL)
@CO 10x_bam_to_fastq:I1(BC:QT)
ST-K00126:334:HGKNJBBXX:4:2118:26920:14519 163 chr21 9412940 39 92M8S = 9412953 90 GGAGTTGTATTGGTGCAGGAAGGGGAGTTTGATTTAATGAAACAATGCATTAAAAATTTGTATTCACTTTGTGATTCAATGATAGTCAATGTTAACATAA AAA<FAJFFJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
JJJJJJJJFAFF<FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ RX:Z:GACACATCAGCTGTTA QX:Z:AAFFFJJJJJJJJJJJ BC:Z:TCTCGGGC QT:Z:AAFFFJJJ XS:i:-13 AS:i:-9 XM:A:0 AM:A:0 XT:i:0 BX:Z:GACACATCAGCTGTTA-1 RG:Z:49255:MissingLibrary:1:unknown_fc:0 OM:i:39
ST-K00126:334:HGKNJBBXX:4:2118:26920:14519 83 chr21 9412953 39 77M = 9412940 -90 TGCAGGAAGGGGAGTTTGATTTAATGAAACAATGCATTAAAAATTTGTATTCACTTTGTGATTCAATGATAGTCAAT FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJ
RX:Z:GACACATCAGCTGTTA QX:Z:AAFFFJJJJJJJJJJJ TR:Z:TGTTAAC TQ:Z:JJJJJJJ BC:Z:TCTCGGGC QT:Z:AAFFFJJJ XS:i:-13 AS:i:-9 XM:A:0 AM:A:0 XT:i:0 BX:Z:GACACATCAGCTGTTA-1 RG:Z:49255:MissingLibrary:1:unknown_fc:0 OM:i:39
ST-K00126:334:HGKNJBBXX:4:1114:32644:45010 99 chr21 9413248 39 77M = 9413263 115 TGAATATTTTCTCAGCAACTGTGGTGTTATGATATATATTGGTTTTCATCCCCAGTTCCTGGCTTATAACTCCCCTA FF<J<FJJJ-J<JFAJFJJAJ-A-<7<A--FJ-AJJJFFFFJF-<FFF-F--7A<FF-<AF<JA-A-JJ-<<7FFF<
RX:Z:NAGGGTGAGGCATGGT QX:Z:#<<AAFFJJFJJA<J< TR:Z:TTCCGCA TQ:Z:<FJJJFA BC:Z:TCTCGGGC QT:Z:AAAFFJJJ XS:i:-12 AS:i:-8 XM:A:0 AM:A:0 XT:i:0 BX:Z:AAGGGTGAGGCATGGT-1 RG:Z:49255:MissingLibrary:1:unknown_fc:0 OM:i:39
Things to look for:
BX:Z:GACACATCAGCTGTTA-1
Informatics Tip: if you’d like to search for all the reads associated with a list of barcodes, this is the fastest way to do it (will need ripgrep)
samtools view -@ 5 possorted_exome_bam.bam | rg -j 5 --no-line-number -F -f BX_list.txt > BC_reads.sam
All 10x specific software and information about 10x specific file formats can be found here
get into the LR env.
source /opt/longranger-2.2.2/sourceme.bash